Automatically select packing ratio #622

irenedea · 2023-09-22T19:20:45Z

Manual test

finetune-auto-pack-BAUz9w https://wandb.ai/mosaic-ml/irene-test/runs/e3pc1puh

54 batches - packing ratio is ~11

finetune-auto-pack-baseline-lvmog9 https://wandb.ai/mosaic-ml/irene-test/runs/vdxwzlxg

618 batches

irenedea · 2023-10-17T23:45:53Z

llmfoundry/data/finetuning/dataloader.py

+        collate_fn=collate_fn,
+        batch_size=dataloader_batch_size,
+        drop_last=cfg.drop_last,
+        # sampler=dist.get_sampler(dataset, # TODO why was this not used in the first return in the original code?


todo: add back in

irenedea · 2023-10-17T23:46:54Z

llmfoundry/data/packing.py

+        for _, leftover in self.collator._leftover_bins:
+            yield leftover
+
+class BinPackCollator:


todo: remove this class altogether in favor of BinPackDataset, logic from call should be moved to BinPackDataset class

irenedea · 2023-10-17T23:47:33Z

llmfoundry/data/packing.py

-                'attention_mask',
-                'bidirectional_mask',
-            ]
-
        # Cut everything down to size


remove comment

irenedea · 2023-10-17T23:48:31Z

llmfoundry/data/packing.py

-            size, trimmed_example = extract_trim_batch_idx(batch, idx)
-            sizes.append(size)
-            trimmed_examples.append(trimmed_example)
+        sizes = [len(example['input_ids']) for example in examples]


can we assume that we no longer need to trim examples if we pack at the dataset level?

Are datasets always unpadded?

irenedea · 2023-10-17T23:49:13Z

llmfoundry/data/packing.py

+        # if k == 'sequence_id':
+        #     example[k] = torch.cat(
+        #         [example[k], add_on[k] + 1 + torch.max(example[k])])


todo: add this back in.

irenedea · 2023-10-17T23:50:19Z

llmfoundry/data/packing.py

+    # min_ratio = 2
+    # max_ratio = 2
+    # num_packing_ratios = 1
+    # profiling_results = profile_packing(dataloader_cfg, tokenizer, min_ratio,
+    #                                     max_ratio, num_packing_ratios,
+    #                                     device_batch_size)
+
+    # # Obtain the maximum packing_ratio/minimum padding that has no waste.
+    # i = 0
+    # waste = 0
+    # packing_ratio = 1
+    # while i < len(profiling_results) and waste == 0:
+    #     packing_ratio, _, waste = profiling_results[i]
+    #     i += 1


uncomment and update min/max ratios as appropriate

probably gonna go for something like max_ratio = max_seq_len / 100
and num_packing_ratios = 15

irenedea · 2023-10-17T23:50:51Z

llmfoundry/data/packing.py

+        return batches
+
+    def profile(raw_batch_size: int) -> Tuple[float, float]:
+        packer = BinPackCollator(


replace in favor of BinPackDataset

irenedea · 2023-10-17T23:51:23Z

tests/test_packing.py

+    assert packed_samples[1] == [7] * 7
+    assert packed_samples[2] == [6] * 6
+
+# def test_auto_packing():


Add test for full auto packing flow

irenedea · 2023-10-19T18:26:54Z

closing because we decided to stick with the collator version. This may cause small amounts of waste in practice, but is much simpler to implement and maintain.

Add support for auto packing ratio

6d53fca

irenedea force-pushed the finetune-packing branch from 9285c85 to 6d53fca Compare September 22, 2023 20:53

irenedea added 8 commits October 17, 2023 06:27

Add test

1c0f157

wip

642f351

wip

5cae418

wip

faf694b

wip

382bd4c

wip

82d5f52

wip

ebea921

Added simple tests

5e2f29a

irenedea commented Oct 17, 2023

View reviewed changes

llmfoundry/data/packing.py

'attention_mask',

'bidirectional_mask',

]

# Cut everything down to size

Copy link

Contributor Author

irenedea Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

irenedea commented Oct 17, 2023

View reviewed changes

irenedea closed this Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically select packing ratio #622

Automatically select packing ratio #622

irenedea commented Sep 22, 2023 •

edited

Loading

irenedea Oct 17, 2023

irenedea Oct 17, 2023

irenedea Oct 17, 2023

irenedea Oct 17, 2023

irenedea Oct 17, 2023

irenedea Oct 17, 2023

irenedea Oct 17, 2023

irenedea Oct 17, 2023

irenedea commented Oct 19, 2023

Automatically select packing ratio #622

Automatically select packing ratio #622

Conversation

irenedea commented Sep 22, 2023 • edited Loading

Manual test

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

irenedea commented Oct 19, 2023

irenedea commented Sep 22, 2023 •

edited

Loading